transformer and convolutional neural network
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels.
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels.
Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks
Panda, Mahadev Prasad, Tiezzi, Matteo, Vilas, Martina, Roig, Gemma, Eskofier, Bjoern M., Zanca, Dario
We introduce Foveation-based Explanations (FovEx), a novel human-inspired visual explainability (XAI) method for Deep Neural Networks. Our method achieves state-of-the-art performance on both transformer (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility. Furthermore, we show the alignment between the explanation map produced by FovEx and human gaze patterns (+14\% in NSS compared to RISE, +203\% in NSS compared to gradCAM), enhancing our confidence in FovEx's ability to close the interpretation gap between humans and machines.
- Europe > Germany > Hesse > Darmstadt Region > Frankfurt (0.05)
- Europe > Italy (0.04)
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- (2 more...)
Visual Transformers and Convolutional Neural Networks for Disease Classification on Radiographs: A Comparison of Performance, Sample Efficiency, and Hidden Stratification
To compare performance, sample efficiency, and hidden stratification of visual transformer (ViT) and convolutional neural network (CNN) architectures for diagnosis of disease on chest radiographs and extremity radiographs using transfer learning. Performance was assessed on internal test sets and 75 000 external chest radiographs (three datasets). The primary comparison was DeiT-B ViT vs DenseNet121 CNN; secondary comparisons included DeiT-Ti (Tiny), ResNet152, and EfficientNetB7. Sample efficiency was evaluated by training models on varying dataset sizes. Hidden stratification was evaluated by comparing prevalence of chest tubes in pneumothorax false-positive and false-negative predictions and specific abnormalities for MURA false-negative predictions.
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)